.. and a fair bit of statistical learning
2 Jun 2025
I owe a debt of gratitude to many people as the thoughts and code in these slides are the process of years-long development cycles and discussions with my team, friends, colleagues and peers. When someone has contributed to the content of the slides, I have credited their authorship.
These materials are generated by Gerko Vink, who holds the copyright. The intellectual property belongs to Utrecht University. Images are either directly linked, or generated with StableDiffusion or DALL-E. That said, there is no information in this presentation that exceeds legal use of copyright materials in academic settings, or that should not be part of the public domain.
Warning
You may use any and all content in this presentation - including my name - and submit it as input to generative AI tools, with the following exception:
Materials
Yesterday we have learned:
Today we will learn how to:
Several questions involving statistics:
What is the relation between \(X\) and \(Y\)? (estimation)
What is the uncertainty around this effect? (estimation/inference)
What can I conclude about my population? (inference/hypothesis testing)
How can I best predict new observations? (prediction)
How can I show relevant patterns in my data? (dimension reduction / pattern recognition)
In supervised learning we aim to quantify the relation between \(Y\) and \(X\).
\(Y\):
\(X\):
We want to find the predictive function:
\[Y = f(X) + \epsilon \]
That minimizes \(\epsilon\) with respect to our goal.
Our aim is to find the \(f(X)\) that best represent the systematic information that \(X\) yields about \(Y\).
With supervised learning every observation on our predictor
\[x_i, i=1, \dots, n\]
has a corresponding outcome measurement
\[y_i\] such that
\[\hat{y_i}=f({\bf x_i})\quad \text{and} \quad y_i = f({\bf x_i})+\epsilon_i.\]
Examples:
With unsupervised learning we have a vector of measurement \(\bf x_i\) for every unit \(i=1, \dots, n\), but we miss the associated response \(y_i\).
There is no outcome to predict
There is no outcome to verify the model
Find patterns in \(\bf x_1, \dots, x_n\)
We can use this model to e.g. find out if some cases are more similar than other cases or which variables explain most of the variation
Examples:
Let’s create some data from a multivariate normal distribution
We start with fixing the random seed
and specifying the variance covariance matrix:
Because the variances are 1, the resulting data will have a correlation of \[\rho = \frac{\text{cov}(y, x)}{\sigma_y\sigma_x} = \frac{.5}{1\times1} = .5.\]
Let’s draw the data
We have added a new column that randomly assigns rows to level A, B or C
For every test observation \(x_0\) the \(K\) points that are close to \(x_0\) are identified.
These closest points form set \(\mathcal{N}_0\).
We estimate the probability for \(x_0\) being part of class \(j\) as the fraction of points in \(\mathcal{N}_0\) for whom the response equals \(j\): \[P(Y = j | X = x_0) = \frac{1}{K}\sum_{i\in\mathcal{N}_0}I(y_i=j)\]
The observation \(x_0\) is classified to the class with the largest probability
An observation gets that class assigned to which most of its \(K\) neighbours belong
Because \(X\) is assigned to the class to which most of the observations belong it is
non-parametric
expected to be far better than logistic regression when decision boundaries are non-linear
However, we do not get parameters as with LDA and regression.
First we need to determine a training set
set.seed(123)
sim.data <-
sim.data %>%
mutate(set = sample(c("Train", "Test"), size = 100,
prob = c(.25, .75), replace = TRUE))
sim.data# A tibble: 100 × 4
x1 x2 class set
<dbl> <dbl> <chr> <chr>
1 5.90 6.13 C Test
2 8.02 6.97 C Train
3 7.07 8.19 C Test
4 3.62 5.40 A Train
5 2.72 5.89 A Train
6 7.78 7.16 C Test
7 5.42 3.71 B Test
8 6.43 8.08 C Train
9 4.97 1.73 B Test
10 4.06 6.22 A Test
# ℹ 90 more rows
Then we split the data into a training (build the model) and a test (verify the model) set
Now we can fit the K-NN model
Let’s make a new observation:
Now we predict the class of this new observation, using the entire data for training our model
K-means clustering partitions our dataset into \(K\) distinct, non-overlapping clusters or subgroups.
A set of relatively similar observations.
This is up to the programmer/researcher to decide. For example, we can say the “within-class” variance is as small as possible and the between-class variance as large as possible.
We expect clusters in our data, but weren’t able to measure them
We want to summarise features into a categorical variable to use in further decisions/analysis
colMeans) for each classSource: James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112). New York: Springer.
K is a tuning parameter (centers)
K-means clustering with 3 clusters of sizes 39, 35, 26
Cluster means:
x1 x2
1 4.994727 3.532014
2 3.591445 6.394369
3 6.845077 6.976240
Clustering vector:
[1] 3 3 3 2 2 3 1 3 1 2 1 1 1 2 2 2 3 3 2 2 3 2 3 1 2 2 3 1 2 1 2 1 3 1 3 1 2
[38] 1 2 1 1 1 1 2 1 2 1 2 3 1 2 3 1 1 1 3 2 2 1 1 2 2 3 1 3 1 1 1 2 1 2 2 2 1
[75] 1 3 3 2 1 1 3 3 3 3 1 2 2 1 2 1 1 3 1 2 3 2 2 3 1 2
Within cluster sum of squares by cluster:
[1] 49.33578 46.51667 49.21032
(between_SS / total_SS = 73.0 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
Supervised learning: outcome / target available
Unsupervised learning: no outcome / target
prediction & pattern recognition vs. estimation, inference, testing
knn: nonparametric classification
kmeans: clustering algorithm
With the power of R we can generate any data we want and know the ‘truth’!
New functions:
hist(): histogramplot(): R’s plotting devicebarplot(): bar plot functionboxplot(): box plot functiondensity(): function that calculates the densityggplot(): ggplot’s plotting deviceSource: Anscombe, F. J. (1973). “Graphs in Statistical Analysis”. American Statistician. 27 (1): 17–21.
Source: https://www.autodeskresearch.com/publications/samestats
base graphics in Rggplot2 graphics age reg wgt hgt bmi hc gen phb tv
223 1 1 1 1 1 1 1 1 1 0
19 1 1 1 1 1 1 1 1 0 1
1 1 1 1 1 1 1 1 0 1 1
1 1 1 1 1 1 1 0 1 0 2
437 1 1 1 1 1 1 0 0 0 3
43 1 1 1 1 1 0 0 0 0 4
16 1 1 1 0 0 1 0 0 0 5
1 1 1 1 0 0 0 0 0 0 6
1 1 1 0 1 0 1 0 0 0 5
1 1 1 0 0 0 1 1 1 1 3
1 1 1 0 0 0 0 1 1 1 4
1 1 1 0 0 0 0 0 0 0 7
3 1 0 1 1 1 1 0 0 0 4
0 3 4 20 21 46 503 503 522 1622
plot() methodggplot2?Layered plotting based on the book The Grammer of Graphics by Leland Wilkinsons.
With ggplot2 you
ggplot2 then takes care of the details
1: Provide the data
2: map variable to aesthetics
3: state which geometric object to display
Create the plot
Add another layer (smooth fit line)
Give it some labels and a nice look
geom_point
geom_bar
geom_line
geom_smooth
geom_histogram
geom_boxplot
geom_density
theme_minimal(), theme_classic(), theme_bw(), …ggthemestheme()Gerko Vink @ Anton de Kom Universiteit, Paramaribo